Moved management of shader module creation and lifetime to its own class. #444

softcookiepp · 2026-01-03T22:05:15Z

As the title says, the creation of vk::ShaderModule objects has been moved from kp::Algorithm kp::Module, a new class.
This will make it easier to implement caching of vk::ShaderModule objects, as well as add aditional features for shader module metadata handling in the future.

…e implementation

…e implementation Signed-off-by: softcookiepp <[email protected]>

Signed-off-by: softcookiepp <[email protected]>

axsaucedo

I have done an initial review, seems like still a way to go, there's a few inconsistencies that need to be addressed, as well as caveats on existing approach and coding style. There also doesn't seem to be a fully complete implementation as this suggests an optionally owned resource but currently the PR does not introduce the functionaly to make it optionally owned. We also need to ensure new functionality is tested, especially in contexts where these resources are owned, and particularly assessing the memory management / lifecycle aspect.

axsaucedo · 2026-01-07T09:29:19Z

src/include/kompute/Cache.hpp

@@ -0,0 +1,27 @@
+#pragma once


This file is quite random, what is this here for - remove

axsaucedo · 2026-01-07T09:41:54Z

src/include/kompute/Shader.hpp

+	/*
+	 * getter for mShaderModule
+	 */
+	vk::ShaderModule& getShaderModule() { return mShaderModule; }


Should be const and should not return a mutable reference

axsaucedo · 2026-01-07T09:42:13Z

src/include/kompute/Shader.hpp

+public:
+
+	/*
+	 * Constructor accepting a device and a SPIR-V binary


Comment conventions should align to rest of codebase - see manager.hpp

axsaucedo · 2026-01-07T09:43:13Z

src/include/kompute/Shader.hpp

+	// the vulkan device; not owned by this object
+	std::weak_ptr<vk::Device> mDevice;
+
+	// the shader module handle


Superfluous comment, see other classes for examples

needs clarity on ownerhsip as this is sometimes owned resource

axsaucedo · 2026-01-07T09:44:34Z

src/include/kompute/Shader.hpp

+ * Wrapper for Vulkan's shader modules.
+ * The purpose of this is to manage the module lifetime, while
+ * building the groundwork for easily integrating things like
+ * SPIR-V reflection and multiple entry points in the future.


Rewrite as description of class as of now not for future / groundwork

axsaucedo · 2026-01-07T09:49:25Z

src/include/kompute/Shader.hpp

+ * building the groundwork for easily integrating things like
+ * SPIR-V reflection and multiple entry points in the future.
+ */
+class Module : public std::enable_shared_from_this<Module>


Naming convention here is not consistent, should be Shader as per file - module is a generic keyword

axsaucedo · 2026-01-07T09:53:06Z

src/include/kompute/Algorithm.hpp

    bool mFreeDescriptorSet = false;
-    std::shared_ptr<vk::ShaderModule> mShaderModule;
-    bool mFreeShaderModule = false;
+	std::shared_ptr<Module> mModule = nullptr;


This is suggested as optionally owned resource but it's not handled as such, follow the same convention as per the rest of the optionally owned resources

axsaucedo · 2026-01-07T09:55:03Z

src/Shader.cpp

+				 spv.size());
+	vk::ShaderModuleCreateInfo shaderModuleInfo(vk::ShaderModuleCreateFlags(),
+		sizeof(uint32_t) * spv.size(), spv.data());
+	this->mDevice.lock()->createShaderModule(


Currently we don't use device.lock throughout the codebase, let's ensure consitency

axsaucedo · 2026-01-07T09:56:23Z

src/include/kompute/Shader.hpp

+class Module : public std::enable_shared_from_this<Module>
+{
+	// the vulkan device; not owned by this object
+	std::weak_ptr<vk::Device> mDevice;


Device is not a weakptr as we need to ensure that this resource doesnt outlive the lifecylce by storing as strong reference and/or ensuring this is handled correctly

axsaucedo · 2026-01-07T09:58:15Z

src/Shader.cpp

+{
+	KP_LOG_DEBUG("Kompute Module destructor started");
+	KP_LOG_DEBUG("Kompute Module Destroying shader module");
+	if (!mDevice.expired() )


Ensure consistency with other classes, including coding style

softcookiepp · 2026-01-08T18:55:15Z

Alright, that should cover everything.

axsaucedo

Thanks @softcookiepp - all tests are failing, let's make sure tests pass locally before re-submitting

softcookiepp · 2026-01-12T16:37:30Z

Right now I cannot even get the tests to run on my system due to this issue: #445
I will likely have to fix this as well before merging; is that alright?

axsaucedo · 2026-01-12T16:40:31Z

Oh right I see, you could also try running it with act https://github.com/nektos/act, as the CI runs with SwiftShader (https://github.com/KomputeProject/kompute/blob/master/docker-builders/Swiftshader.Dockerfile); or you can just set up SwiftShader, which allows you to run against CPU instead.

softcookiepp · 2026-01-12T16:44:59Z

Thanks! Will do.

softcookiepp · 2026-01-12T19:54:22Z

Alright, I got all the tests passing. I also changed the vk::Device queue family selection criteria from just vk::QueueFlagBits::eCompute to (vk::QueueFlagBits::eCompute | vk::QueueFlagBits::eTransfer). In practice most queue families supporting vk::QueueFlagBits::eCompute, will also support vk::QueueFlagBits::eTransfer, but there are always edge cases. It is never a good idea to assume the driver will take care of everything in Vulkan; the only reason the test with multiple queues passed at all is because you got lucky with the way your drivers handle it!

The queue family selection in general could use some reworking to prevent invalid API usage (see here: #445 (comment)) but I will leave that for another day.

axsaucedo

Further changes requested - some strange changes that don't have anything to do with the PR, let's avoid this. Let's also make sure that this code is tested by evaluating an Algo rebuild as this would trigger the full cycle.

axsaucedo · 2026-01-13T16:02:57Z

test/TestAsyncOperations.cpp

    for (uint32_t i = 0; i < numParallel; i++) {
-        inputsAsyncB.push_back(mgr.tensor(data));
-        algosAsync.push_back(mgr.algorithm({ inputsAsyncB[i] }, spirv));
+        inputsAsyncB.push_back(mgrAsync.tensor(data));


Why are you changing this?

The test was wrong and was causing my driver to crash. It was trying to use different kp::Manager (and therefore different underlying vk::Device) instances for allocating buffers, creating pipelines and executing them. You can't do this, even if the underlying hardware is the same. If you read the Vulkan validation errors while running the tests, you would see this:

[Jan 8 2026 11:08:47] [debug] [Manager.cpp:42] [VALIDATION]: Validation - vkCmdPipelineBarrier(): pBufferMemoryBarriers[0].buffer (VkBuffer 0x280000000028) was created, allocated or retrieved from VkDevice 0x5a2129296950, but command is using (or its dispatchable parameter is associated with) VkDevice 0x5a2129386e80

My change here fixes this.

It still isn't completely fixed; see here for details: #445 (comment)

But this is not "fixing" anything. This is a test that is currently set up on specific hardware, which is testing parallel execution. What you are doing is just not running the test. This is not "fixing" something.

It still crashes, but the validation error in my message no longer appears when I attempt to run the test. So it is a step in the right direction towards compliant API usage, but not a complete fix.

I need to see what this validation error may be, but your change basically makes this test redundant, so it's not correct. If you read through the test you are making it to not compare anything; the test is profiling a parallel queue with a non-parallel queue on a specific GPU that actually supports parallel processing (not async, parallel). Here's more info: https://medium.com/data-science/parallelizing-heavy-gpu-workloads-via-multi-queue-operations-50a38b15a1dc

The simplified reasoning is that you are taking an underlying vk::PhysicalDevice and creating two different vk::Device instances with it, since each kp::Manager creates its own vk::Device. Though these may have the same physical GPU, the Vulkan driver may initialize separate resources every time a vk::Device is created for a given GPU.

What do these resources look like? It really depends on the driver. NVIDIA drivers (which given your comments, it seems you have) are more robust in this regard, while AMD drivers (which I am using right now) are quite a bit more unforgiving about things like this.

So when you try to use memory and pipelines allocated with mgr (which has its own vk::Device instance) with mgrAsync (which has a totally separate vk::Device), you are violating one of the core assumptions that Vulkan drivers make. You may as well be asking for it to mix and match resources created on different physical GPUs.

As for why the test is redundant now, I am really not sure; both mgr and mgrAsync are still used in this test. I just made sure mgrAsync did not accept tensors and algorithms created with mgr. If you could point to exactly which lines break the test and why, that would be very much appreciated.

src/Algorithm.cpp

src/Shader.cpp

src/Manager.cpp

src/Algorithm.cpp

axsaucedo

Added a few more comments but seems almost there. Are there any tests that we can add as well? Seems to me this segfault piece would benefit from one. We also need to make sure that the documentation is updated. Once these are resolved we should be able to merge. Thank you.

axsaucedo · 2026-01-15T18:45:38Z

Thanks @softcookiepp - almost there, I just added a follow up reply to claify the tests

softcookiepp and others added 4 commits January 3, 2026 14:02

moved shader module creation to its own class in preparation for cach…

c75c161

…e implementation

moved shader module creation to its own class in preparation for cach…

74841ce

…e implementation Signed-off-by: softcookiepp <[email protected]>

Merge branch 'master' of github.com:softcookiepp/kompute

5d737b9

Signed-off-by: softcookiepp <[email protected]>

Removed Cache.cpp, as it will not be used.

6fdc886

Signed-off-by: softcookiepp <[email protected]>

axsaucedo requested changes Jan 8, 2026

View reviewed changes

softcookiepp added 2 commits January 8, 2026 10:24

completely forgot to remove Cache.hpp, as it is not used

acba5b8

as you wish

64edc39

softcookiepp requested a review from axsaucedo January 8, 2026 19:43

axsaucedo requested changes Jan 12, 2026

View reviewed changes

softcookiepp and others added 3 commits January 12, 2026 09:17

Fixed incorrect kp::Manager usage in TestAsyncOperations.cpp

8eb39d5

Merge branch 'master' into master

97a6ea1

Tests pass now

ea0889a

axsaucedo requested changes Jan 13, 2026

View reviewed changes

removed mSpirv from kp::Algorithm

47d3a41

axsaucedo requested changes Jan 15, 2026

View reviewed changes

revered a silly change made based on silly assumptions

30b2277

Moved management of shader module creation and lifetime to its own class. #444

Are you sure you want to change the base?

Moved management of shader module creation and lifetime to its own class. #444

Uh oh!

Conversation

softcookiepp commented Jan 3, 2026

Uh oh!

axsaucedo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

softcookiepp commented Jan 8, 2026

Uh oh!

axsaucedo left a comment

Choose a reason for hiding this comment

Uh oh!

softcookiepp commented Jan 12, 2026

Uh oh!

axsaucedo commented Jan 12, 2026

Uh oh!

softcookiepp commented Jan 12, 2026

Uh oh!

softcookiepp commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axsaucedo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

softcookiepp Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

axsaucedo left a comment

Choose a reason for hiding this comment

Uh oh!

axsaucedo commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

softcookiepp commented Jan 12, 2026 •

edited

Loading

softcookiepp Jan 15, 2026 •

edited

Loading